Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral scaling: Reduce perplexity from 4.294 to 4.269 #301

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

casper-hansen
Copy link
Owner

This PR reduces the perplexity of Mixtral by introducing scaling of individual experts instead of one scale for all the experts in the MoE block.

To use this in vLLM, GGUF, or other integrations, their code needs to be updated to load the new ScaledMixtralSparseMoeBlock. The only modification needed in the inference code is right before the experts get the current_hidden_states:

current_state = hidden_states[None, top_x_list].reshape(
    -1, hidden_dim) / self.scales[expert_idx] # <-- scales

Idea by @Sakits, engineering & implementation by @casper-hansen

@casper-hansen
Copy link
Owner Author

CC @younesbelkada. Not sure if this would break anything in the transformers integration. WDYT?

Copy link
Collaborator

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @casper-hansen !
Unless I am missing something I think this should be all good for transformers integration for now, meaning this PR is totally BC with transforers integration since it does not seem to touch any core module that we use (GEMM_Linear). If we want to upstream this into transformers we need to think about an API for users to replace the SparseMoE layers with ScaledMixtralSparseMoeBlock

@younesbelkada
Copy link
Collaborator

As a sanity check I would run basic inference with transformers after emrging this PR just to be sure, but looking at the PR it does not seem to do anything that would break transformers integration

@casper-hansen
Copy link
Owner Author

As a sanity check I would run basic inference with transformers after emrging this PR just to be sure, but looking at the PR it does not seem to do anything that would break transformers integration

Right, but we do introduce a scale in the MoE layer that needs to be loaded in order to run inference. Could it be possible to add this during the loading of the linear layers?

FYI, for MPT models, we need the same thing but just for a different layer.

@vince62s
Copy link

the ppl improvement is really small did you try other scores to see if this is worth it ?

@casper-hansen
Copy link
Owner Author

the ppl improvement is really small did you try other scores to see if this is worth it ?

This is how it should have been implemented from the start as this is a more correct implementation of AWQ. Yes, I have tried other combinations but it does not lead to better perplexity (at least the one’s I tried)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants